2  Preparing Data for Tables

Now that we’ve covered the main types of variables, we can start thinking about how to prepare and visually represent data with figures and tables. In this section, we will focus on preparing data for tables.

In R, datasets are stored as dataframes, which are objects made up of rows and columns. Each row corresponds to an individual observation, while each column represents the variable that is being measured. There are several ways R stores variables: character for text, factor for categorical, integer for whole numbers, numeric for decimals, and logical for booleans (TRUE/FALSE).

Cleaning data can be the most time-consuming, yet arguably the most important, part of analysing data! Generally, it is

To understand R, we will examine these concepts using real-world datasets.

NAPLAN

Photo by N. K. on Unsplash

The first dataset to be explored is the NAPLAN dataset. NAPLAN stands for the National Assessment Program - Literacy and Numeracy, and is an annual national assessment for all students in year 3, 5, 7, and 9. The assessment is designed to evaluate whether students are building essential literacy and numeracy skills (https://www.nap.edu.au/naplan/faqs/naplan–general).

To load in the dataset, we can use the read.csv() function.

naplan <- read.csv("Data-sets/naplan_reading.csv")

Try doing this yourself.

naplan <- read.csv("Data-sets/naplan_reading.csv")
naplan <- read.csv("Data-sets/naplan_reading.csv")

Now that we have the dataset loaded in, let’s take a look at the variables in the dataset using str().

str(naplan)
str(naplan)

In the top line, we can see that there are 3000 observations and 11 variables.

We can confirm this using the dim() function

dim(naplan)
dim(naplan)

Using the names() function, we can see the variable names.

names(naplan)
names(naplan)

The first column gives us the variables, the second column gives us the type of variable, and the third column gives us individual observations for each row. However, we can see school_type is stored as a chr or character (text) variable. Text variables are very common when you encounter surveys that have customer feedback, for example, as each response will be unique from one another.

It is important to check that your variables are stored correctly. We can see certain variables like reading time at home is an integer, which matches our expectations.

You can use the unique() function to see the unique groupings or labels within a categorical variable. In our case, we can use it to confirm if our suspicions that school_type is categorical. To call a specific variable, we call the dataset followed by $ and then the variable name.

unique(naplan$school_type)
[1] "Government"  "Independent" "Catholic"   

There are only 3 unique labels, which strongly suggests that school type is a nominal categorical variable. If it were truly a text variable, we might see thousands of unique labels.

We can manually convert the variable to categorical using as.factor():

naplan$school_type <- as.factor(naplan$school_type)
str(naplan)
'data.frame':   3000 obs. of  11 variables:
 $ student_id          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ school_id           : int  51 26 14 21 33 26 37 41 8 60 ...
 $ grade               : chr  "Year 7" "Year 3" "Year 7" "Year 3" ...
 $ reading_time_home   : int  70 70 30 0 20 25 5 90 45 10 ...
 $ parent_education    : chr  "Certificate/Diploma" "Year 12" "Bachelor degree" "Bachelor degree" ...
 $ school_type         : Factor w/ 3 levels "Catholic","Government",..: 2 2 2 3 2 2 2 2 2 2 ...
 $ gender              : chr  "Female" "Female" "Male" "Female" ...
 $ birth_months        : chr  "Aug" "Apr" "Jul" "Sep" ...
 $ n_siblings          : int  2 1 1 1 0 1 3 2 1 3 ...
 $ ses_index           : num  0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
 $ naplan_reading_score: int  724 439 631 491 635 592 449 484 730 528 ...

Now we can see school_type is a factor with 3 levels (or 3 unique labels).

Alternatively, we can also convert all strings (text) to categorical variables with the stringsAsFactors argument when we read in the dataset.

naplan <- read.csv("Data-sets/naplan_reading.csv",
                   stringsAsFactors = TRUE)
str(naplan)
'data.frame':   3000 obs. of  11 variables:
 $ student_id          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ school_id           : int  51 26 14 21 33 26 37 41 8 60 ...
 $ grade               : Factor w/ 4 levels "Year 3","Year 5",..: 3 1 3 1 4 4 1 3 4 3 ...
 $ reading_time_home   : int  70 70 30 0 20 25 5 90 45 10 ...
 $ parent_education    : Factor w/ 5 levels "Bachelor degree",..: 2 5 1 1 5 4 4 1 4 4 ...
 $ school_type         : Factor w/ 3 levels "Catholic","Government",..: 2 2 2 3 2 2 2 2 2 2 ...
 $ gender              : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 1 1 1 2 ...
 $ birth_months        : Factor w/ 12 levels "Apr","Aug","Dec",..: 2 1 6 12 3 10 2 12 11 8 ...
 $ n_siblings          : int  2 1 1 1 0 1 3 2 1 3 ...
 $ ses_index           : num  0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
 $ naplan_reading_score: int  724 439 631 491 635 592 449 484 730 528 ...

As seen above, all the chr variables have been converted to Factor, which is how the variables should be stored. Note that there might be times where you need a variable stored as chr, so you should always be sure of what you’re checking.

We can now attach the dataset. Attaching the dataset means we can call variables directly without using $. For instance, we can call school_type directly as opposed to naplan$school_type.

attach(naplan)

If we type naplan directly in, we can see the content.

naplan

Alternatively, we can use the View() function to open up a new tab in RStudio with the dataset.

View(naplan)

We can also select parts of the dataframe. This can be very useful for deciding which data you want to include a table. In R, we subset by calling the dataset followed by [,] . Writing a blank space before the comma indicates that all rows are includes, while a blank space after the comma indicates that all columns are included. For example, [,3] selects the third column of a dataframe. If we want to select the first three columns, we can write:

Try selecting the first three rows and first three columns of the naplan dataset. You select a number of rows or columns using :. For example, to select rows 3-4, simply type naplan[3:4,]

naplan[1:3,1:3]
naplan[1:3,1:3]

Now try selecting only rows 5-6.

naplan[5:6,]
naplan[5:6,]

We can even select rows based on logical tests on the values of one or more variables.

For instance, let’s say we want all observations where reading time is greater than 45 mins. We can do this by typing naplan[reading_time_home>45, ].

Try having a go below by selecting rows with reading times greater than 30 mins AND where students havea reading score greater than 800. Hint; you will need to use the logical operator &

naplan[reading_time_home > 30 & naplan_reading_score > 860,]
naplan[reading_time_home > 30 & naplan_reading_score > 860,]

The dyplr package can simplify this process. Some helpful functions include select, mutate, and filter. You can load in tidyverse which has dplyr built in, or load in dyplr directly.

library(tidyverse)

Let’s filter our data by Year 3 students with reading times at home above 100 mins.

naplan %>%
  filter(grade == "Year 3" & reading_time_home > 100)
  student_id school_id  grade reading_time_home parent_education school_type
1        976         3 Year 3               120  Bachelor degree  Government
2       1005        44 Year 3               120  Bachelor degree Independent
3       1512        38 Year 3               115 Year 10 or below  Government
4       2378        34 Year 3               115          Year 12    Catholic
5       2473        52 Year 3               105  Bachelor degree  Government
  gender birth_months n_siblings ses_index naplan_reading_score
1 Female          Aug          3      0.24                  539
2   Male          Mar          2      2.20                  600
3   Male          Aug          3     -1.21                  533
4 Female          Jul          3      0.07                  319
5 Female          Sep          1      1.02                  455

As we saw above, student_id and school_id is stored as integers. As these numbers actually represent unique labels, they should be treated as categorical. We can mutate student_id and school_id to factor variables.

naplan <- naplan %>%
  mutate(across(c(school_id, student_id), as.factor))
str(naplan)
'data.frame':   3000 obs. of  11 variables:
 $ student_id          : Factor w/ 3000 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ school_id           : Factor w/ 60 levels "1","2","3","4",..: 51 26 14 21 33 26 37 41 8 60 ...
 $ grade               : Factor w/ 4 levels "Year 3","Year 5",..: 3 1 3 1 4 4 1 3 4 3 ...
 $ reading_time_home   : int  70 70 30 0 20 25 5 90 45 10 ...
 $ parent_education    : Factor w/ 5 levels "Bachelor degree",..: 2 5 1 1 5 4 4 1 4 4 ...
 $ school_type         : Factor w/ 3 levels "Catholic","Government",..: 2 2 2 3 2 2 2 2 2 2 ...
 $ gender              : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 1 1 1 2 ...
 $ birth_months        : Factor w/ 12 levels "Apr","Aug","Dec",..: 2 1 6 12 3 10 2 12 11 8 ...
 $ n_siblings          : int  2 1 1 1 0 1 3 2 1 3 ...
 $ ses_index           : num  0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
 $ naplan_reading_score: int  724 439 631 491 635 592 449 484 730 528 ...

We can combine several of these pipelines together. For example, we can select student_id and school_id, rename student_id to student and school_id to school, then mutate them back to integers and store the result in a new dataset called naplan_new.

naplan_new <- naplan %>%
  select(student = student_id,
         school = school_id) %>%
  mutate(across(c(school, student), as.integer))

str(naplan_new)
'data.frame':   3000 obs. of  2 variables:
 $ student: int  1 2 3 4 5 6 7 8 9 10 ...
 $ school : int  51 26 14 21 33 26 37 41 8 60 ...

We can summarise the content of the dataframe as well.

summary(naplan)
   student_id     school_id       grade     reading_time_home
 1      :   1   38     :  87   Year 3:735   Min.   :  0.00   
 2      :   1   20     :  83   Year 5:771   1st Qu.:  5.00   
 3      :   1   41     :  83   Year 7:745   Median : 15.00   
 4      :   1   26     :  81   Year 9:749   Mean   : 21.62   
 5      :   1   60     :  81                3rd Qu.: 30.00   
 6      :   1   48     :  80                Max.   :120.00   
 (Other):2994   (Other):2505                                 
            parent_education      school_type      gender      birth_months 
 Bachelor degree    :771     Catholic   : 540   Female:1451   Dec    : 325  
 Certificate/Diploma:701     Government :2023   Male  :1549   Aug    : 314  
 Postgraduate       :297     Independent: 437                 Jan    : 300  
 Year 10 or below   :466                                      Sep    : 299  
 Year 12            :765                                      Jul    : 296  
                                                              Nov    : 296  
                                                              (Other):1170  
   n_siblings      ses_index       naplan_reading_score
 Min.   :0.000   Min.   :-2.3200   Min.   :177.0       
 1st Qu.:1.000   1st Qu.:-0.1100   1st Qu.:491.0       
 Median :2.000   Median : 0.5300   Median :562.0       
 Mean   :1.899   Mean   : 0.5246   Mean   :558.5       
 3rd Qu.:3.000   3rd Qu.: 1.1500   3rd Qu.:618.2       
 Max.   :5.000   Max.   : 3.4100   Max.   :900.0       
                                                       

You might notice that the summary doesn’t quite work for categorical variables.

We can use tapply() and with() to summarise a quantitative variable by a qualitative variable.

with(naplan, tapply(naplan_reading_score, grade, mean))
  Year 3   Year 5   Year 7   Year 9 
457.1687 539.7821 594.5785 641.1722 

aggregate() allows you to summarise by one or more categorical variables, while tapply can only summarise one variable at a time.

aggregate(naplan[,c(4,11)], list(grade),mean)
  Group.1 reading_time_home naplan_reading_score
1  Year 3          21.59184             457.1687
2  Year 5          21.83528             539.7821
3  Year 7          22.24832             594.5785
4  Year 9          20.80774             641.1722
aggregate(naplan[,c(4,11)], list(Grade = grade),mean)
   Grade reading_time_home naplan_reading_score
1 Year 3          21.59184             457.1687
2 Year 5          21.83528             539.7821
3 Year 7          22.24832             594.5785
4 Year 9          20.80774             641.1722
aggregate(naplan[,c(4,11)], list(Grade = grade, "Parent Education" = parent_education),mean)
    Grade    Parent Education reading_time_home naplan_reading_score
1  Year 3     Bachelor degree          21.35000             478.5500
2  Year 5     Bachelor degree          19.63054             564.7783
3  Year 7     Bachelor degree          21.15591             618.3065
4  Year 9     Bachelor degree          18.48901             653.3187
5  Year 3 Certificate/Diploma          18.17919             457.6127
6  Year 5 Certificate/Diploma          21.78977             536.1705
7  Year 7 Certificate/Diploma          23.15217             595.2554
8  Year 9 Certificate/Diploma          22.44048             642.1786
9  Year 3        Postgraduate          20.06494             487.3117
10 Year 5        Postgraduate          20.19481             575.4026
11 Year 7        Postgraduate          23.10606             628.7727
12 Year 9        Postgraduate          17.20779             684.5325
13 Year 3    Year 10 or below          22.47475             431.2323
14 Year 5    Year 10 or below          25.49587             505.1983
15 Year 7    Year 10 or below          21.30435             558.1391
16 Year 9    Year 10 or below          21.48855             609.4809
17 Year 3             Year 12          25.18817             435.0914
18 Year 5             Year 12          22.55155             524.3351
19 Year 7             Year 12          22.70619             581.1546
20 Year 9             Year 12          22.56545             632.9686

Exercise: Palmer Penguins

Photo by Derek Oyen on Unsplash

First, we will need to load in some data into R. We will be working with the palmerpenguins package. The palmerpenguins package is a dataset collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

If you have not used this package before, you will need to install it first.

install.packages("palmerpenguins")

Once installed, you will need to load the package into R.

# Load in the `palmerpenguins` package library(palmerpenguins)
# Load in the `palmerpenguins` package
library(palmerpenguins)

Now that we have the package installed, let’s take a look at the variables in the dataset using str() on the penguins dataset.

# Call str() on `penguins` str(penguins)
# Call str() on `penguins`
str(penguins)

You might notice that some values are listed as NA. This means there are missing observations. To check for missing values, try using the embedded any(is.na()) function.

# Try using any(is.na()) on the `penguins` dataset. any(is.na(penguins))
# Try using any(is.na()) on the `penguins` dataset.
any(is.na(penguins))

Since this returns TRUE, we know there are missing values in the dataset. Let’s find out how many.

# Try using sum(is.na()) on the `penguins` dataset. sum(is.na(penguins))
# Try using sum(is.na()) on the `penguins` dataset.
sum(is.na(penguins))

There are missing observations. This can cause problems when generating numerical summaries or running statistical tests. At this stage, to handle this, we can remove the missing values. It’s always best practice to store your cleaned data in a new dataset and leave the original untouched. Note that you should always be careful when removing observations, as even partially filled observations can still provide valuable information.

clean_penguins <- na.omit(penguins)
clean_penguins <- na.omit(penguins)

We can check the structure again to make sure everything looks good:

str(clean_penguins)
tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
 $ bill_depth_mm    : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
 $ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
 $ body_mass_g      : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
 $ year             : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
 - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
  ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...

We can see that there are eight variables. Three of these are categorical: species, island, and sex. These are stored as factors, where the levels represent the number of distinct labels or groups within each variable. For example, sex has two levels: male and female.

You’ll notice that there are four numerical variables: bill_length_mm, bill_depth_mm, flipper_length_mm, and year. If your instinct is that year doesn’t quite fit as right as a numerical variable, you would be right! It is better to treat year as a categorical variable.

How do we manually convert year to a factor?

clean_penguins$year <- as.factor(clean_penguins$year)
clean_penguins$year <- as.factor(clean_penguins$year)

Now if we run class(), we can see that year is now correctly stored as a categorical variable with three levels.

class(clean_penguins$year)
[1] "factor"

How can we check the unique groupings or labels within year?

unique(clean_penguins$year)
unique(clean_penguins$year)

This shows that the data was collected over the years , , and (ascending order).

You can also see from the structure output str() that the penguins dataset is stored as a 333 x 8 tibble, which is a type of data frame in R. This means that there are 333 rows (observations) and 8 columns (variables).

Check the dimensions of the cleaned dataset to confirm this.

dim(clean_penguins)
dim(clean_penguins)

If you wish to have a closer look at inspecting the penguins dataset, you can use the View() function to open it in a new tab on RStudio.

View(penguins)
View(clean_penguins)

Tables

Now that we’ve inspected the dataset, we can start thinking about how to visually represent and understand the data.

As we have learnt

library(gt)
library(tidyverse)
library(kableExtra)
naplan %>% 
  dplyr::filter(reading_time_home > 30 & naplan_reading_score > 800) %>%
  gt() %>%
  tab_header(title = "Naplan") %>%
  fmt_number(decimals=2) %>%
  cols_label(
    student_id = md("***Student ID***"),
    school_id = md("***School ID***"),
    grade = md("***Grade***"),
    reading_time_home = md("***Reading Time***"),
    parent_education = md("***Parent Education***"),
    school_type = md("***School Type***"),
    gender = md("***Gender***"),
    birth_months = md("***Birth Month***"),
    n_siblings = md("***Number of Siblings***"),
    ses_index = md("***SES Index***"),
    naplan_reading_score = md("***Reading Score***")
  )
Naplan

Student ID

School ID

Grade

Reading Time

Parent Education

School Type

Gender

Birth Month

Number of Siblings

SES Index

Reading Score

493 1 Year 9 65.00 Year 12 Independent Male Aug 0.00 1.05 900.00
584 1 Year 9 40.00 Certificate/Diploma Independent Female Sep 1.00 0.68 900.00
626 35 Year 9 40.00 Bachelor degree Government Female Dec 2.00 1.16 804.00
741 42 Year 9 105.00 Year 12 Government Male Aug 5.00 −0.44 900.00
1118 42 Year 9 95.00 Certificate/Diploma Government Male Jul 2.00 −0.28 883.00
1168 31 Year 9 60.00 Postgraduate Catholic Female Aug 2.00 2.19 871.00
1174 35 Year 9 65.00 Year 12 Government Male Aug 2.00 −0.33 900.00
1447 23 Year 9 75.00 Year 12 Independent Male Sep 2.00 0.30 900.00
1792 5 Year 9 70.00 Year 12 Government Male Aug 1.00 0.52 900.00
1906 5 Year 9 70.00 Year 10 or below Government Male Jan 5.00 −0.71 887.00
1952 35 Year 9 75.00 Year 12 Government Male Aug 1.00 −0.17 900.00
2122 5 Year 9 55.00 Postgraduate Government Male Jul 3.00 1.87 900.00
2359 35 Year 9 60.00 Bachelor degree Government Female Dec 5.00 1.33 900.00
2425 47 Year 9 50.00 Postgraduate Independent Female Aug 2.00 2.02 866.00
2691 55 Year 9 80.00 Year 12 Government Female Dec 2.00 −0.18 837.00
2712 31 Year 9 60.00 Year 12 Catholic Male Dec 2.00 0.81 809.00
2756 53 Year 9 55.00 Year 10 or below Government Female Mar 1.00 −0.40 840.00
naplan %>% 
  filter(gender == "Female", school_type == "Independent") %>%
  arrange(desc(naplan_reading_score)) %>%
  slice_head(n = 10) %>%
  gt() %>%
  tab_header(title = "Naplan") %>%
  fmt_number(decimals = 2) %>%
  cols_label(
    student_id = md("***Student ID***"),
    school_id = md("***School ID***"),
    grade = md("***Grade***"),
    reading_time_home = md("***Reading Time***"),
    parent_education = md("***Parent Education***"),
    school_type = md("***School Type***"),
    gender = md("***Gender***"),
    birth_months = md("***Birth Month***"),
    n_siblings = md("***Number of Siblings***"),
    ses_index = md("***SES Index***"),
    naplan_reading_score = md("***Reading Score***")
  ) %>%
  data_color(
    columns = naplan_reading_score,
    method = "numeric",
    palette = "viridis",
    domain = c(0, 1000),
    reverse = TRUE
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "gray95"),
      cell_borders(sides = c("l", "r"), color = "gray50", weight = px(3))
    ),
    locations = cells_body(columns = -naplan_reading_score)
  ) %>%
  tab_style(
    style = cell_fill(color = "gray98"),
    locations = cells_title()
  ) %>%
  tab_style(
    style = list(cell_fill(color = "gray35"), cell_text(color = "white")),
    locations = list(cells_footnotes(), cells_source_notes())
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_title(groups = "title")
  ) %>% 
  tab_style(
    style = cell_borders(
      sides = c("t", "b"),
      color = "darkgrey",
      weight = px(3)
    ),
    locations = list(cells_column_labels(), cells_stubhead())
  ) %>%
tab_footnote(
    footnote = "SES Index represents ",
    locations = cells_column_labels(columns = ses_index)
  )
Naplan

Student ID

School ID

Grade

Reading Time

Parent Education

School Type

Gender

Birth Month

Number of Siblings

SES Index

1

Reading Score

584 1 Year 9 40.00 Certificate/Diploma Independent Female Sep 1.00 0.68 900.00
2425 47 Year 9 50.00 Postgraduate Independent Female Aug 2.00 2.02 866.00
1007 1 Year 9 25.00 Bachelor degree Independent Female Apr 0.00 1.09 831.00
887 1 Year 7 50.00 Bachelor degree Independent Female Jul 1.00 0.85 800.00
2166 2 Year 7 60.00 Bachelor degree Independent Female May 1.00 2.40 800.00
2610 23 Year 9 35.00 Bachelor degree Independent Female Oct 1.00 1.44 796.00
305 1 Year 9 35.00 Certificate/Diploma Independent Female Apr 3.00 0.46 788.00
1220 47 Year 9 35.00 Certificate/Diploma Independent Female Jul 5.00 0.76 782.00
133 47 Year 9 55.00 Year 10 or below Independent Female Sep 3.00 0.36 780.00
1383 47 Year 9 30.00 Certificate/Diploma Independent Female Sep 2.00 1.53 774.00
1 SES Index represents
naplan |>
  dplyr::group_by(gender) |>
  dplyr::summarise(
    mean_score = mean(naplan_reading_score, na.rm = TRUE),
    mean_ses   = mean(ses_index, na.rm = TRUE),
    .groups = "drop"
  ) |>
  tidyr::pivot_wider(
    names_from = gender,
    values_from = c(mean_score, mean_ses)
  ) |>
  gt() |>
  fmt_number(columns = everything(), decimals = 2) |>
  tab_spanner(
    label = "Female",
    columns = matches("_Female$")
  ) |>
  tab_spanner(
    label = "Male",
    columns = matches("_Male$")
  ) |>
  cols_label(
    matches("mean_score") ~ "Reading Score<br>Average",
    matches("mean_ses") ~ "SES Index<br>Average",
    .fn=md
  ) |>
  cols_width(everything() ~ px(120))
Female Male

Reading Score
Average

SES Index
Average

Reading Score
Average

SES Index
Average

563.00 0.50 554.21 0.55
library(gtsummary)
naplan |>
  tbl_summary(
    by = gender,
    include=c(grade, reading_time_home, naplan_reading_score),
    statistic=list(all_continuous()~"{mean} ({sd})"),
    digits=list(naplan_reading_score = c(0,1),
                reading_time_home = c(0,1))
  )

Characteristic

Female
N = 1,451

1

Male
N = 1,549

1
grade

    Year 3 350 (24%) 385 (25%)
    Year 5 378 (26%) 393 (25%)
    Year 7 356 (25%) 389 (25%)
    Year 9 367 (25%) 382 (25%)
reading_time_home 22 (22.9) 21 (22.7)
naplan_reading_score 563 (95.7) 554 (92.0)
1

n (%); Mean (SD)

naplan |>
  tbl_summary(
    include=c(reading_time_home, naplan_reading_score),
    type=all_continuous() ~ "continuous2",
    statistic=list(all_continuous()~c("{median} ({p25}, {p75})", "{min}, {max}"))
    )

Characteristic

N = 3,000

reading_time_home
    Median (Q1, Q3) 15 (5, 30)
    Min, Max 0, 120
naplan_reading_score
    Median (Q1, Q3) 562 (491, 619)
    Min, Max 177, 900
naplan %>%
  filter(gender == "Female", school_type == "Independent") %>%
  arrange(desc(naplan_reading_score)) %>%
  slice_head(n = 10) %>%
  select(
    `Student ID` = student_id,
    `School ID` = school_id,
    Grade = grade,
    `Reading Time` = reading_time_home,
    `Parent Education` = parent_education,
    `School Type` = school_type,
    Gender = gender,
    `Birth Month` = birth_months,
    `Number of Siblings` = n_siblings,
    `SES Index` = ses_index,
    `Reading Score` = naplan_reading_score
  ) %>%
  mutate(
    `Reading Score` = cell_spec(
      `Reading Score`,
      background = spec_color(`Reading Score`, option = "viridis", end = 0.95),
      color = "white"
    )
  ) %>%
  kable(
    format = "html",
    align = "c",
    booktabs = TRUE,
    escape = FALSE,
    caption = "Naplan"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "gray35") %>%
  column_spec(
    1:10,
    background = "gray95",
    extra_css = "border-left: 3px solid gray50; border-right: 3px solid gray50;"
  )  %>%  footnote(general = "SES Index represents ...", general_title = "")
Naplan
Student ID School ID Grade Reading Time Parent Education School Type Gender Birth Month Number of Siblings SES Index Reading Score
584 1 Year 9 40 Certificate/Diploma Independent Female Sep 1 0.68 900
2425 47 Year 9 50 Postgraduate Independent Female Aug 2 2.02 866
1007 1 Year 9 25 Bachelor degree Independent Female Apr 0 1.09 831
887 1 Year 7 50 Bachelor degree Independent Female Jul 1 0.85 800
2166 2 Year 7 60 Bachelor degree Independent Female May 1 2.40 800
2610 23 Year 9 35 Bachelor degree Independent Female Oct 1 1.44 796
305 1 Year 9 35 Certificate/Diploma Independent Female Apr 3 0.46 788
1220 47 Year 9 35 Certificate/Diploma Independent Female Jul 5 0.76 782
133 47 Year 9 55 Year 10 or below Independent Female Sep 3 0.36 780
1383 47 Year 9 30 Certificate/Diploma Independent Female Sep 2 1.53 774
SES Index represents ...